1 Executive Summary

The aim of the analysis is to build a prediction model that can estimate the average cost for 2 people to stay in an AirBnB in Amsterdam for 4 nights. As such, we analysed data about Airbnb listings in the city and identified predictors that are highly relevant to the price and fitted several models. Those predictors are selected first based on common sense and were analysed in terms of their price expanatory power. We then tested several models in order to find out which could predict the price better. The best model was chosen based on how well it helps to predict the prices given different inputs. The price is then estimated based on the best model. The best model includes variables such as property type, number_of_reviews, review_scores_rating, bathrooms, bedrooms, accommodates, host_is_superhost, neighbourhood_simplified, availability_30 and reviews_per_month. These are consistent with common sense metrics that would lead to increases or decreases in price.

Even though there are many variables in the dataframe, here is a quick description of some of the variables collected

  • price = cost per night

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood*: three variables on a few major neighbourhoods in each city

2 Exploratory Data Analysis (EDA)

In the R4DS Exploratory Data Analysis chapter, the authors state:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation… EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Conduct a thorough EDA. Recall that an EDA involves three things:

  • Looking at the raw values.
    • dplyr::glimpse()
  • Computing summary statistics of the variables of interest, or finding NAs
    • mosaic::favstats()
    • skimr::skim()
  • Creating informative visualizations.
    • ggplot2::ggplot()
      • geom_histogram() or geom_density() for numeric continuous variables
      • geom_bar() or geom_col() for categorical variables
    • GGally::ggpairs() for scaterrlot/correlation matrix
      • Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = gender, alpha = 0.4)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:

  • How many variables/columns? How many rows/observations?
  • Which variables are numbers?
  • Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
  • What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

2.1 Data wrangling

Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.

Notice that some of the price data (price) is given as a character string, e.g., “$176.00”

Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the first number

Use typeof(listing$price) to confirm that price is now stored as a number.

glimpse(listings)
Rows: 16,116
Columns: 74
$ id                                           <dbl> 2818, 20168, 25428, 27886…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021091e+13, 2.021091e+1…
$ last_scraped                                 <date> 2021-09-07, 2021-09-07, …
$ name                                         <chr> "Quiet Garden View Room &…
$ description                                  <chr> "Quiet Garden View Room &…
$ neighborhood_overview                        <chr> "Indische Buurt (\"Indies…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 3159, 59484, 56142, 97647…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Daniel", "Alexander", "J…
$ host_since                                   <date> 2008-09-24, 2009-12-02, …
$ host_location                                <chr> "Amsterdam, Noord-Holland…
$ host_about                                   <chr> "Upon arriving in Amsterd…
$ host_response_time                           <chr> "within an hour", "within…
$ host_response_rate                           <chr> "100%", "100%", "N/A", "8…
$ host_acceptance_rate                         <chr> "100%", "100%", "0%", "10…
$ host_is_superhost                            <lgl> TRUE, FALSE, TRUE, TRUE, …
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Indische Buurt", "Gracht…
$ host_listings_count                          <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_total_listings_count                    <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_verifications                           <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, FALSE, TRUE, …
$ neighbourhood                                <chr> "Amsterdam, North Holland…
$ neighbourhood_cleansed                       <chr> "Oostelijk Havengebied - …
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 52.36435, 52.36407, 52.37…
$ longitude                                    <dbl> 4.94358, 4.89393, 4.88487…
$ property_type                                <chr> "Private room in rental u…
$ room_type                                    <chr> "Private room", "Private …
$ accommodates                                 <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 2…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1.5 shared baths", "1 pr…
$ bedrooms                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ beds                                         <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2…
$ amenities                                    <chr> "[\"Single level home\", …
$ price                                        <chr> "$59.00", "$106.00", "$12…
$ minimum_nights                               <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ maximum_nights                               <dbl> 28, 365, 120, 730, 1825, …
$ minimum_minimum_nights                       <dbl> 3, 1, 7, 2, 2, 2, 90, 4, …
$ maximum_minimum_nights                       <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ minimum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ maximum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ minimum_nights_avg_ntm                       <dbl> 3.0, 1.0, 13.8, 2.0, 2.0,…
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 120, 1125, 18…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 3, 0, 1, 9, 11, 8, 4, 0, …
$ availability_60                              <dbl> 28, 0, 1, 20, 27, 23, 34,…
$ availability_90                              <dbl> 55, 0, 3, 47, 50, 50, 64,…
$ availability_365                             <dbl> 124, 0, 57, 66, 298, 313,…
$ calendar_last_scraped                        <date> 2021-09-07, 2021-09-07, …
$ number_of_reviews                            <dbl> 280, 339, 5, 223, 353, 49…
$ number_of_reviews_ltm                        <dbl> 2, 0, 0, 4, 19, 17, 0, 0,…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 2, 8, 10, 0, 0, …
$ first_review                                 <date> 2013-08-25, 2014-01-17, …
$ last_review                                  <date> 2019-11-21, 2020-03-27, …
$ review_scores_rating                         <dbl> 4.89, 4.44, 5.00, 4.95, 4…
$ review_scores_accuracy                       <dbl> 4.93, 4.69, 5.00, 4.93, 4…
$ review_scores_cleanliness                    <dbl> 5.00, 4.79, 5.00, 4.96, 4…
$ review_scores_checkin                        <dbl> 4.97, 4.63, 5.00, 4.95, 4…
$ review_scores_communication                  <dbl> 4.97, 4.62, 5.00, 4.92, 4…
$ review_scores_location                       <dbl> 4.68, 4.87, 5.00, 4.90, 4…
$ review_scores_value                          <dbl> 4.81, 4.49, 4.80, 4.80, 4…
$ license                                      <chr> "0363 5F3A 5684 6750 D14D…
$ instant_bookable                             <lgl> TRUE, TRUE, FALSE, TRUE, …
$ calculated_host_listings_count               <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 1, 2, 0, 1, 2, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 2.86, 3.64, 0.11, 2.14, 4…
favstats(listings$id)
minQ1medianQ3maxmeansdnmissing
2.82e+031.02e+071.93e+073.11e+075.21e+072.12e+071.35e+07161160
skim(listings)
Data summary
Name listings
Number of rows 16116
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 16116 0
name 30 1.00 1 240 0 15766 0
description 223 0.99 1 1000 0 15711 0
neighborhood_overview 5711 0.65 1 1000 0 9495 0
picture_url 0 1.00 61 126 0 15968 0
host_url 0 1.00 38 43 0 14197 0
host_name 5 1.00 1 33 0 5166 0
host_location 35 1.00 2 86 0 560 0
host_about 7066 0.56 1 8920 0 7522 16
host_response_time 5 1.00 3 18 0 5 0
host_response_rate 5 1.00 2 4 0 57 0
host_acceptance_rate 5 1.00 2 4 0 91 0
host_thumbnail_url 5 1.00 55 106 0 14161 0
host_picture_url 5 1.00 57 109 0 14161 0
host_neighbourhood 5713 0.65 4 35 0 68 0
host_verifications 0 1.00 2 158 0 361 0
neighbourhood 5711 0.65 11 59 0 67 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 67 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 21 1.00 6 17 0 26 0
amenities 0 1.00 2 1488 0 14667 0
price 0 1.00 5 9 0 498 0
license 11561 0.28 6 63 0 3624 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
host_since 5 1.00 2008-09-24 2021-08-31 2015-04-07 3235
calendar_last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
first_review 2087 0.87 2011-04-19 2021-09-07 2017-10-30 2467
last_review 2087 0.87 2011-06-12 2021-09-07 2019-06-10 2019

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.12 FAL: 14103, TRU: 2008
host_has_profile_pic 5 1 1.00 TRU: 16078, FAL: 33
host_identity_verified 5 1 0.67 TRU: 10720, FAL: 5391
neighbourhood_group_cleansed 16116 0 NaN :
bathrooms 16116 0 NaN :
calendar_updated 16116 0 NaN :
has_availability 0 1 0.96 TRU: 15457, FAL: 659
instant_bookable 0 1 0.23 FAL: 12408, TRU: 3708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.118184e+07 13520629.26 2.818000e+03 1.018020e+07 1.926593e+07 3.107548e+07 5.208280e+07 ▇▇▆▅▂
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 6.976052e+07 90271635.64 3.159000e+03 9.735558e+06 2.974134e+07 8.988321e+07 4.210037e+08 ▇▂▁▁▁
host_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
host_total_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
latitude 0 1.00 5.237000e+01 0.02 5.229000e+01 5.236000e+01 5.236000e+01 5.238000e+01 5.243000e+01 ▁▁▇▅▁
longitude 0 1.00 4.890000e+00 0.04 4.760000e+00 4.860000e+00 4.890000e+00 4.910000e+00 5.070000e+00 ▁▆▇▁▁
accommodates 0 1.00 2.840000e+00 1.31 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▃▁▁▁
bedrooms 898 0.94 1.530000e+00 0.95 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+01 ▇▁▁▁▁
beds 97 0.99 1.760000e+00 1.47 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.300000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 3.990000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.081500e+02 540.67 1.000000e+00 2.100000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▇▁▁▇▁
minimum_minimum_nights 3 1.00 3.950000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 3 1.00 4.150000e+00 21.03 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 3 1.00 6.855200e+02 532.48 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▆▁▁▇▁
maximum_maximum_nights 3 1.00 2.672465e+05 23924509.45 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 3 1.00 4.030000e+00 21.00 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 3 1.00 2.666400e+05 23870381.65 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.142625e+09 ▇▁▁▁▁
availability_30 0 1.00 4.030000e+00 8.70 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+01 ▇▁▁▁▁
availability_60 0 1.00 8.840000e+00 18.14 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 6.000000e+01 ▇▁▁▁▁
availability_90 0 1.00 1.432000e+01 28.51 0.000000e+00 0.000000e+00 0.000000e+00 5.000000e+00 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 5.532000e+01 107.91 0.000000e+00 0.000000e+00 0.000000e+00 4.700000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 2.465000e+01 56.71 0.000000e+00 2.000000e+00 8.000000e+00 2.200000e+01 8.770000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 1.390000e+00 7.48 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.220000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 3.300000e-01 2.03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.370000e+02 ▇▁▁▁▁
review_scores_rating 2087 0.87 4.690000e+00 0.67 0.000000e+00 4.670000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 2301 0.86 4.810000e+00 0.35 0.000000e+00 4.750000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 2300 0.86 4.700000e+00 0.44 0.000000e+00 4.600000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 2309 0.86 4.850000e+00 0.31 0.000000e+00 4.820000e+00 4.950000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 2304 0.86 4.870000e+00 0.30 0.000000e+00 4.850000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 2309 0.86 4.730000e+00 0.33 0.000000e+00 4.600000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 2309 0.86 4.600000e+00 0.39 0.000000e+00 4.500000e+00 4.670000e+00 4.830000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.640000e+00 2.40 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.100000e+00 1.87 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 4.900000e-01 1.53 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.13 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 ▇▁▁▁▁
reviews_per_month 2087 0.87 6.800000e-01 1.72 1.000000e-02 1.100000e-01 2.700000e-01 6.200000e-01 1.078400e+02 ▇▁▁▁▁
#Storing the price as a number
listings_clean <- listings %>% 
  filter(name != "",
         host_location != "") %>% 
  mutate(price = parse_number(price),
         bathrooms = parse_number(gsub( "\\s.*", "", bathrooms_text))) %>% 
  filter(bathrooms != "")

typeof(listings_clean$price)
[1] "double"
skim(listings_clean)
Data summary
Name listings_clean
Number of rows 15957
Number of columns 74
_______________________
Column type frequency:
character 23
Date 5
logical 7
numeric 39
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 15957 0
name 0 1.00 1 240 0 15647 0
description 205 0.99 1 1000 0 15578 0
neighborhood_overview 5628 0.65 1 1000 0 9437 0
picture_url 0 1.00 61 126 0 15809 0
host_url 0 1.00 38 43 0 14073 0
host_name 0 1.00 1 33 0 5133 0
host_location 0 1.00 2 86 0 555 0
host_about 6954 0.56 1 8920 0 7489 16
host_response_time 0 1.00 3 18 0 5 0
host_response_rate 0 1.00 2 4 0 57 0
host_acceptance_rate 0 1.00 2 4 0 91 0
host_thumbnail_url 0 1.00 55 106 0 14046 0
host_picture_url 0 1.00 57 109 0 14046 0
host_neighbourhood 5655 0.65 4 35 0 68 0
host_verifications 0 1.00 2 158 0 360 0
neighbourhood 5628 0.65 11 59 0 66 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 66 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 0 1.00 6 16 0 23 0
amenities 0 1.00 2 1488 0 14532 0
license 11441 0.28 6 63 0 3609 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
host_since 0 1.00 2008-09-24 2021-08-31 2015-04-02 3231
calendar_last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
first_review 2050 0.87 2011-04-19 2021-09-07 2017-11-01 2462
last_review 2050 0.87 2011-06-12 2021-09-07 2019-06-10 2015

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.12 FAL: 13969, TRU: 1988
host_has_profile_pic 0 1 1.00 TRU: 15928, FAL: 29
host_identity_verified 0 1 0.67 TRU: 10620, FAL: 5337
neighbourhood_group_cleansed 15957 0 NaN :
calendar_updated 15957 0 NaN :
has_availability 0 1 0.96 TRU: 15315, FAL: 642
instant_bookable 0 1 0.23 FAL: 12281, TRU: 3676

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.118191e+07 13504046.15 2.818000e+03 1.019025e+07 1.930569e+07 3.106922e+07 5.208280e+07 ▇▇▆▅▂
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 6.963628e+07 90162614.81 3.159000e+03 9.667780e+06 2.956976e+07 9.006509e+07 4.210037e+08 ▇▂▁▁▁
host_listings_count 0 1.00 2.020000e+00 23.45 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
host_total_listings_count 0 1.00 2.020000e+00 23.45 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
latitude 0 1.00 5.237000e+01 0.02 5.229000e+01 5.236000e+01 5.236000e+01 5.238000e+01 5.243000e+01 ▁▁▇▅▁
longitude 0 1.00 4.890000e+00 0.04 4.760000e+00 4.860000e+00 4.890000e+00 4.910000e+00 5.070000e+00 ▁▆▇▁▁
accommodates 0 1.00 2.840000e+00 1.31 1.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
bathrooms 0 1.00 1.200000e+00 0.41 0.000000e+00 1.000000e+00 1.000000e+00 1.500000e+00 1.300000e+01 ▇▁▁▁▁
bedrooms 866 0.95 1.530000e+00 0.95 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+01 ▇▁▁▁▁
beds 80 0.99 1.760000e+00 1.47 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.300000e+01 ▇▁▁▁▁
price 0 1.00 1.573700e+02 173.87 4.000000e+00 9.800000e+01 1.300000e+02 1.800000e+02 8.000000e+03 ▇▁▁▁▁
minimum_nights 0 1.00 3.970000e+00 21.01 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.077800e+02 540.94 1.000000e+00 2.100000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▇▁▁▇▁
minimum_minimum_nights 0 1.00 3.930000e+00 21.01 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 4.130000e+00 21.06 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 6.849200e+02 532.67 1.000000e+00 2.700000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▆▁▁▇▁
maximum_maximum_nights 0 1.00 2.698519e+05 24041164.00 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 4.000000e+00 21.02 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 2.692394e+05 23986772.28 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.142625e+09 ▇▁▁▁▁
availability_30 0 1.00 4.010000e+00 8.68 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+01 ▇▁▁▁▁
availability_60 0 1.00 8.810000e+00 18.10 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 6.000000e+01 ▇▁▁▁▁
availability_90 0 1.00 1.427000e+01 28.45 0.000000e+00 0.000000e+00 0.000000e+00 5.000000e+00 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 5.506000e+01 107.55 0.000000e+00 0.000000e+00 0.000000e+00 4.600000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 2.462000e+01 56.74 0.000000e+00 2.000000e+00 8.000000e+00 2.200000e+01 8.770000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 1.380000e+00 7.47 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.220000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 3.200000e-01 2.01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.370000e+02 ▇▁▁▁▁
review_scores_rating 2050 0.87 4.690000e+00 0.67 0.000000e+00 4.670000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 2261 0.86 4.810000e+00 0.35 0.000000e+00 4.750000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 2260 0.86 4.700000e+00 0.44 0.000000e+00 4.600000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 2269 0.86 4.850000e+00 0.31 0.000000e+00 4.820000e+00 4.950000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 2264 0.86 4.870000e+00 0.30 0.000000e+00 4.850000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 2269 0.86 4.730000e+00 0.33 0.000000e+00 4.600000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 2269 0.86 4.600000e+00 0.39 0.000000e+00 4.500000e+00 4.670000e+00 4.830000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.630000e+00 2.37 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.100000e+00 1.88 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 4.700000e-01 1.46 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.14 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 ▇▁▁▁▁
reviews_per_month 2050 0.87 6.800000e-01 1.72 1.000000e-02 1.100000e-01 2.700000e-01 6.200000e-01 1.078400e+02 ▇▁▁▁▁
#explore the distribution of bedrooms and beds
ggplot(listings_clean) +
  geom_density(aes(x=bedrooms))+
  theme_bw()

ggplot(listings_clean) +
  geom_density(aes(x=beds))+
  theme_bw()

ggplot(listings_clean) +
  geom_density(aes(x=price))+
  theme_bw()

> We noticed that for each variable there are a lot of outliers outside where the majority data concentrate and we would like to remove those uncommon observations later.

# assumptions of the followings variables are made in accordance to common choices for tourists
listings_clean1 <- listings_clean %>% filter(bedrooms < 5) 
listings_clean3 <- listings_clean1 %>% filter (minimum_nights <= 4)
listings_clean4 <- listings_clean3 %>% filter (price < 1500) # Everything above that would be extraordinarily expensive
listings_clean5 <- listings_clean4 %>% filter (accommodates > 1)

# cleaning the data and choosing the relevant variables
listings_smaller <- listings_clean5 %>% 
  select(host_location,
         neighbourhood,
         neighbourhood_cleansed,
         neighbourhood_group_cleansed,
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bathrooms_text,
         bedrooms,
         beds,
         price,
         review_scores_rating,
         minimum_nights,
         maximum_nights,
         property_type,
         latitude,
         longitude,
         host_is_superhost,
         instant_bookable,
         availability_30,
         reviews_per_month)

2.2 Key Explanatory Variables

# investigate the correlation among some key variables
listings_smaller %>% 
  select(price, bedrooms, beds, bathrooms, review_scores_rating) %>% 
  ggpairs(aes(alpha = 0.4))

#box plot for number of beds (1 and 2) relative to prices
ggplot(listings_smaller, aes(x = price)) +
  geom_boxplot() +
  facet_wrap(~bedrooms)

#comparison of review scores and prices
ggplot(listings_smaller, aes(x = review_scores_rating, y = price)) +
  geom_point(aes(alpha = 0.4))

Plot 1 showcases that 2 bedrooms are relatively more (but only slightly) expensive than 1 bedroom in Amsterdam; however the range and variability of prices of 2 bedrooom rooms appears to be greater than a 1 bedroom, with the 75% percentile being significantly higher for 2 bedrooms compared to 1. The moderately strong positive correlation of 0.378 between bedrooms and price also indicates the same. Similarly, the plots also indicate that Higher ratings is also associated with higher prices, even though the correlation between the two variables is only weakly positive.

0 review ratings do not tell us much about the quality of the accomodation nor the price. It can be explained by that new listing usually don’t have reviews and 0 ratings are therefore not informative

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

# property type
listings_clean5 %>% 
  group_by(property_type) %>%
  summarise(count = n()) %>% 
  arrange(desc(count))
property_typecount
Entire rental unit7806
Private room in rental unit1406
Entire residential home794
Entire townhouse325
Private room in bed and breakfast270
Entire condominium (condo)261
Private room in residential home234
Entire loft199
Houseboat157
Boat126
Room in boutique hotel125
Private room in townhouse122
Private room in houseboat95
Private room in guest suite91
Room in hotel83
Private room in boat72
Entire serviced apartment69
Private room in condominium (condo)67
Private room in loft44
Room in bed and breakfast31
Entire guest suite25
Private room in hostel21
Entire villa17
Room in serviced apartment16
Shared room in rental unit16
Entire guesthouse12
Private room12
Room in aparthotel12
Private room in guesthouse11
Private room in serviced apartment10
Private room in farm stay7
Private room in villa6
Entire cabin5
Entire cottage5
Entire place5
Room in hostel5
Tiny house5
Barn4
Entire bed and breakfast4
Entire bungalow4
Private room in casa particular4
Private room in tiny house4
Entire home/apt3
Shared room in houseboat3
Camper/RV2
Entire chalet2
Floor2
Private room in cabin2
Private room in dome house2
Private room in floor2
Shared room in bed and breakfast2
Shared room in residential home2
Tower2
Campsite1
Earth house1
Private room in bungalow1
Private room in earth house1
Private room in island1
Private room in nature lodge1
Shared room in boat1
Shared room in hostel1
#creating a simplified version of `property_type`
listings_clean5 <- listings_clean5 %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit","Entire residential home", "Entire townhouse","Private room in rental unit") ~ property_type, 
    TRUE ~ "Other"
  ))

As shown in the table above, the four most common property types in Airbnb Amsterdam are “Entire rental unit”, “private room in rental unit”, “entire condo” and “entire residential home”. Collectively they account for the majority of the property types available in Amsterdam. The significantly high number of row houses, condos and houses with lofts relative to other major cities like Rotterdam (more metropolitan with numerous skyscapers) also justifies the relatively high number of condos and lofts.

Use the code below to check that prop_type_simplified was correctly made.

listings_clean5 %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))        
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit7806
Private room in rental unitPrivate room in rental unit1406
Entire residential homeEntire residential home794
Entire townhouseEntire townhouse325
Private room in bed and breakfastOther270
Entire condominium (condo)Other261
Private room in residential homeOther234
Entire loftOther199
HouseboatOther157
BoatOther126
Room in boutique hotelOther125
Private room in townhouseOther122
Private room in houseboatOther95
Private room in guest suiteOther91
Room in hotelOther83
Private room in boatOther72
Entire serviced apartmentOther69
Private room in condominium (condo)Other67
Private room in loftOther44
Room in bed and breakfastOther31
Entire guest suiteOther25
Private room in hostelOther21
Entire villaOther17
Room in serviced apartmentOther16
Shared room in rental unitOther16
Entire guesthouseOther12
Private roomOther12
Room in aparthotelOther12
Private room in guesthouseOther11
Private room in serviced apartmentOther10
Private room in farm stayOther7
Private room in villaOther6
Entire cabinOther5
Entire cottageOther5
Entire placeOther5
Room in hostelOther5
Tiny houseOther5
BarnOther4
Entire bed and breakfastOther4
Entire bungalowOther4
Private room in casa particularOther4
Private room in tiny houseOther4
Entire home/aptOther3
Shared room in houseboatOther3
Camper/RVOther2
Entire chaletOther2
FloorOther2
Private room in cabinOther2
Private room in dome houseOther2
Private room in floorOther2
Shared room in bed and breakfastOther2
Shared room in residential homeOther2
TowerOther2
CampsiteOther1
Earth houseOther1
Private room in bungalowOther1
Private room in earth houseOther1
Private room in islandOther1
Private room in nature lodgeOther1
Shared room in boatOther1
Shared room in hostelOther1

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?
  • Is there any value among the common values that stands out?
  • What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
listings %>%
  count(minimum_nights) %>%
  arrange(desc(n))  
minimum_nightsn
2      5993
3      3918
1      2658
4      1367
5      895
7      481
6      234
14      109
10      92
30      63
8      33
21      32
20      30
25      19
12      18
15      18
28      16
60      15
9      12
90      11
13      7
59      7
180      7
11      4
19      4
23      4
29      4
50      4
58      4
100      4
16      3
27      3
31      3
200      3
300      3
365      3
18      2
26      2
42      2
45      2
99      2
150      2
1e+03      2
24      1
33      1
37      1
38      1
63      1
70      1
75      1
80      1
89      1
94      1
95      1
120      1
181      1
183      1
186      1
222      1
240      1
500      1
999      1
1e+03      1
1.1e+031
# Filtered out everything above 4 min. nights in listings_clean
ggplot(listings_clean5, aes(x=minimum_nights)) + 
  geom_histogram()

The most common values appearing for the variable “minimum nights” are 1,2 and 3 nights. Out of these 3 values, 2 nights is by far the most commonly occuring value. Given the fact that Amsterdam is amongst one of the top cities for a weekend visit by tourists residing in UK and western Europe, a minimum of 2 nights is seems logical. Similarly, 30 nights is the 9th highest with it being only 1 of 2 values to be greater than 10. The fact that any reservation >29 days is considered as a “long-term stay” and the host can earn significant benefits from the same could explain why a minimum of 30 nights is so common.

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

3 Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings_clean5)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

The concentration of the listings are around central amsterdam and south-east Amsterdam. This is an expected result given that central and south-east amsterdam tend to the key tourist spots like the “nine streets” and thus justifies the greater concentration of listings in this areas.

4 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

  • Interpret the coefficient review_scores_rating in terms of price_4_nights.
  • Interpret the coefficient of prop_type_simplified in terms of price_4_nights.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.

listings_clean5 <- listings_clean5 %>% 
  mutate(price_4_nights = price * 4) %>%  # we did not include accomodates because the given price is already including all people
  filter(review_scores_rating > 0) # we exclude 0's because new listing usually don't have reviews and are therefore not informative
  

ggplot(listings_clean5, aes(x = price_4_nights)) +
  geom_density()

ggplot(listings_clean5, aes(x = log(price_4_nights))) +
  geom_density()

4.1 Model 1

model1 <- lm(log(price_4_nights) ~ prop_type_simplified +
               number_of_reviews +
               review_scores_rating,
             data = listings_clean5)

summary(model1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5924 -0.2978 -0.0340  0.2566  2.2791 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.8030257  0.0606399  95.696
prop_type_simplifiedEntire residential home      0.1895242  0.0179221  10.575
prop_type_simplifiedEntire townhouse             0.2700822  0.0267483  10.097
prop_type_simplifiedOther                       -0.0760105  0.0116508  -6.524
prop_type_simplifiedPrivate room in rental unit -0.5310931  0.0140631 -37.765
number_of_reviews                               -0.0005345  0.0000725  -7.372
review_scores_rating                             0.1174017  0.0126436   9.285
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                       7.14e-11 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                               1.80e-13 ***
review_scores_rating                             < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4416 on 11049 degrees of freedom
Multiple R-squared:  0.1714,    Adjusted R-squared:  0.171 
F-statistic: 380.9 on 6 and 11049 DF,  p-value: < 2.2e-16

4.2 Model 2

model2 <- lm(log(price_4_nights) ~ prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               room_type,
             data = listings_clean5)

summary(model2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5892 -0.2860 -0.0347  0.2469  2.3326 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.931e+00  5.876e-02 100.931
prop_type_simplifiedEntire residential home      1.914e-01  1.731e-02  11.053
prop_type_simplifiedEntire townhouse             2.728e-01  2.584e-02  10.556
prop_type_simplifiedOther                        2.384e-01  1.603e-02  14.876
prop_type_simplifiedPrivate room in rental unit  1.688e-02  2.388e-02   0.707
number_of_reviews                               -1.873e-04  7.118e-05  -2.632
review_scores_rating                             8.942e-02  1.226e-02   7.297
room_typeHotel room                             -2.502e-01  5.595e-02  -4.472
room_typePrivate room                           -5.660e-01  2.029e-02 -27.898
room_typeShared room                            -6.546e-01  9.434e-02  -6.939
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  0.47951    
number_of_reviews                                0.00851 ** 
review_scores_rating                            3.15e-13 ***
room_typeHotel room                             7.84e-06 ***
room_typePrivate room                            < 2e-16 ***
room_typeShared room                            4.18e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4266 on 11046 degrees of freedom
Multiple R-squared:  0.2269,    Adjusted R-squared:  0.2263 
F-statistic: 360.3 on 9 and 11046 DF,  p-value: < 2.2e-16
vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.176895  4        1.195657
number_of_reviews    1.147273  1        1.071108
review_scores_rating 1.016980  1        1.008455
room_type            4.304831  3        1.275438

The variable (log(price_4_nights)) has been used as it effectively captures the compunding effect; it also allows for an earier and more meaningful interpretation of the regression results (in percentage terms). This is also showcased by the density plots; the variable price_4_nights is right skewed whereas the logged variable reflects a normal distribution bell curve.

4.3 Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
  2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
  3. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
  4. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
  5. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?

4.4 Model 3

model3 <- lm(log(price_4_nights) ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               # room_type + GVIF higher than 5
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost +
               instant_bookable,
             data = listings_clean5)

summary(model3)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bathrooms + bedrooms + beds + accommodates + 
    host_is_superhost + instant_bookable, data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4023 -0.2356 -0.0085  0.2225  2.2901 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.103e+00  5.576e-02  91.516
prop_type_simplifiedEntire residential home     -1.263e-01  1.668e-02  -7.570
prop_type_simplifiedEntire townhouse            -4.733e-02  2.415e-02  -1.960
prop_type_simplifiedOther                       -7.893e-02  1.057e-02  -7.468
prop_type_simplifiedPrivate room in rental unit -4.165e-01  1.255e-02 -33.196
number_of_reviews                               -6.000e-04  6.596e-05  -9.095
review_scores_rating                             1.180e-01  1.124e-02  10.497
bathrooms                                        1.322e-01  1.105e-02  11.957
bedrooms                                         1.180e-01  8.536e-03  13.826
beds                                             9.030e-03  5.463e-03   1.653
accommodates                                     1.245e-01  5.462e-03  22.798
host_is_superhostTRUE                            7.001e-02  1.149e-02   6.095
instant_bookableTRUE                             6.228e-03  9.236e-03   0.674
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home     4.02e-14 ***
prop_type_simplifiedEntire townhouse              0.0500 .  
prop_type_simplifiedOther                       8.77e-14 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                                < 2e-16 ***
review_scores_rating                             < 2e-16 ***
bathrooms                                        < 2e-16 ***
bedrooms                                         < 2e-16 ***
beds                                              0.0984 .  
accommodates                                     < 2e-16 ***
host_is_superhostTRUE                           1.13e-09 ***
instant_bookableTRUE                              0.5002    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3832 on 11022 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.3752,    Adjusted R-squared:  0.3745 
F-statistic: 551.6 on 12 and 11022 DF,  p-value: < 2.2e-16
vif(model3)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.569514  4        1.057963
number_of_reviews    1.220562  1        1.104790
review_scores_rating 1.049764  1        1.024580
bathrooms            1.189945  1        1.090846
bedrooms             2.980894  1        1.726527
beds                 2.899654  1        1.702837
accommodates         2.784538  1        1.668693
host_is_superhost    1.216296  1        1.102858
instant_bookable     1.099468  1        1.048555
listings_clean5 <- listings_clean5 %>% 
  mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Oostelijk Havengebied - Indische Buurt",
                                                                             "Oud-Oost",
                                                                             "Watergraafsmeer",
                                                                             "Noord-Oost",
                                                                             "IJburg - Zeeburgereiland",
                                                                             "IJburg - Zeeburgereiland",
                                                                             "Gaasperdam - Driemond",
                                                                             "Bijlmer-Oost")~"East",
                                              neighbourhood_cleansed %in% c("Centrum-Oost",
                                                                            "Centrum-West",
                                                                            "De Pijp - Rivierenbuurt",
                                                                            "Bijlmer-Centrum")~"Center",
                                              neighbourhood_cleansed %in% c("De Baarsjes - Oud-West",
                                                                            "Westerpark",
                                                                            "Slotervaart",
                                                                            "Bos en Lommer",
                                                                            "De Aker - Nieuw Sloten",
                                                                            "Osdorp",
                                                                            "Geuzenveld - Slotermeer")~"West",
                                              neighbourhood_cleansed %in% c("Zuid",
                                                                            "Buitenveldert - Zuidas")~"South",
                                              neighbourhood_cleansed %in% c("Noord-West",
                                                                            "Oud-Noord")~"North"))
                                                
                                              
unique(listings_clean$neighbourhood_simplified)
NULL

4.5 Model 4

model4 <- lm(log(price_4_nights) ~ neighbourhood_simplified, data = listings_clean5)

summary(model4)

Call:
lm(formula = log(price_4_nights) ~ neighbourhood_simplified, 
    data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6384 -0.3080 -0.0140  0.2786  2.2069 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    6.41097    0.00806 795.435  < 2e-16 ***
neighbourhood_simplifiedEast  -0.18461    0.01335 -13.831  < 2e-16 ***
neighbourhood_simplifiedNorth -0.21179    0.02186  -9.688  < 2e-16 ***
neighbourhood_simplifiedSouth -0.08287    0.01751  -4.732 2.25e-06 ***
neighbourhood_simplifiedWest  -0.18407    0.01102 -16.705  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4774 on 11051 degrees of freedom
Multiple R-squared:  0.03132,   Adjusted R-squared:  0.03097 
F-statistic: 89.32 on 4 and 11051 DF,  p-value: < 2.2e-16

4.6 Model 5

model5 <- lm(log(price_4_nights) ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               # room_type + GVIF higher than 5
               bathrooms +
               bedrooms +
               beds +
               accommodates +
               host_is_superhost +
               instant_bookable +
               neighbourhood_simplified +
               availability_30 +
               reviews_per_month,
             data = listings_clean5)

summary(model5)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bathrooms + bedrooms + beds + accommodates + 
    host_is_superhost + instant_bookable + neighbourhood_simplified + 
    availability_30 + reviews_per_month, data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5230 -0.2182 -0.0018  0.2091  2.4078 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.188e+00  5.215e-02  99.489
prop_type_simplifiedEntire residential home     -4.658e-02  1.616e-02  -2.882
prop_type_simplifiedEntire townhouse            -1.639e-02  2.257e-02  -0.726
prop_type_simplifiedOther                       -1.315e-01  1.024e-02 -12.852
prop_type_simplifiedPrivate room in rental unit -4.428e-01  1.177e-02 -37.638
number_of_reviews                               -7.300e-04  6.843e-05 -10.668
review_scores_rating                             1.288e-01  1.048e-02  12.291
bathrooms                                        1.215e-01  1.031e-02  11.784
bedrooms                                         1.372e-01  7.983e-03  17.193
beds                                             6.232e-03  5.098e-03   1.222
accommodates                                     1.150e-01  5.102e-03  22.535
host_is_superhostTRUE                            5.013e-02  1.075e-02   4.664
instant_bookableTRUE                            -8.320e-03  8.669e-03  -0.960
neighbourhood_simplifiedEast                    -2.519e-01  1.016e-02 -24.797
neighbourhood_simplifiedNorth                   -3.391e-01  1.707e-02 -19.870
neighbourhood_simplifiedSouth                   -1.323e-01  1.318e-02 -10.036
neighbourhood_simplifiedWest                    -1.799e-01  8.376e-03 -21.474
availability_30                                  1.147e-02  4.384e-04  26.162
reviews_per_month                               -1.232e-02  2.258e-03  -5.456
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      0.00396 ** 
prop_type_simplifiedEntire townhouse             0.46792    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                                < 2e-16 ***
review_scores_rating                             < 2e-16 ***
bathrooms                                        < 2e-16 ***
bedrooms                                         < 2e-16 ***
beds                                             0.22156    
accommodates                                     < 2e-16 ***
host_is_superhostTRUE                           3.14e-06 ***
instant_bookableTRUE                             0.33721    
neighbourhood_simplifiedEast                     < 2e-16 ***
neighbourhood_simplifiedNorth                    < 2e-16 ***
neighbourhood_simplifiedSouth                    < 2e-16 ***
neighbourhood_simplifiedWest                     < 2e-16 ***
availability_30                                  < 2e-16 ***
reviews_per_month                               4.99e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3572 on 11016 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.4575,    Adjusted R-squared:  0.4566 
F-statistic:   516 on 18 and 11016 DF,  p-value: < 2.2e-16
vif(model5)
                             GVIF Df GVIF^(1/(2*Df))
prop_type_simplified     1.827265  4        1.078264
number_of_reviews        1.511856  1        1.229575
review_scores_rating     1.051581  1        1.025466
bathrooms                1.192647  1        1.092084
bedrooms                 3.000618  1        1.732229
beds                     2.906443  1        1.704829
accommodates             2.796295  1        1.672212
host_is_superhost        1.225863  1        1.107187
instant_bookable         1.114740  1        1.055813
neighbourhood_simplified 1.162419  4        1.018991
availability_30          1.165880  1        1.079759
reviews_per_month        1.424323  1        1.193450
autoplot(model5)

4.7 Model 6

Removed variable beds due to high correlation with bedrooms and variable instant_bookable which is insignificant in model 5.

model6 <- lm(log(price_4_nights) ~
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating +
               # room_type + GVIF higher than 5
               bathrooms +
               bedrooms +
               accommodates +
               host_is_superhost +
               neighbourhood_simplified +
               availability_30 +
               reviews_per_month,
             data = listings_clean5)

summary(model6)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bathrooms + bedrooms + accommodates + 
    host_is_superhost + neighbourhood_simplified + availability_30 + 
    reviews_per_month, data = listings_clean5)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5216 -0.2174 -0.0021  0.2091  2.4143 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.170e+00  5.154e-02 100.311
prop_type_simplifiedEntire residential home     -4.574e-02  1.614e-02  -2.834
prop_type_simplifiedEntire townhouse            -1.552e-02  2.256e-02  -0.688
prop_type_simplifiedOther                       -1.344e-01  1.006e-02 -13.360
prop_type_simplifiedPrivate room in rental unit -4.443e-01  1.172e-02 -37.909
number_of_reviews                               -7.229e-04  6.839e-05 -10.571
review_scores_rating                             1.315e-01  1.040e-02  12.641
bathrooms                                        1.218e-01  1.032e-02  11.808
bedrooms                                         1.413e-01  7.359e-03  19.208
accommodates                                     1.178e-01  4.505e-03  26.158
host_is_superhostTRUE                            4.923e-02  1.072e-02   4.593
neighbourhood_simplifiedEast                    -2.517e-01  1.015e-02 -24.802
neighbourhood_simplifiedNorth                   -3.384e-01  1.707e-02 -19.819
neighbourhood_simplifiedSouth                   -1.315e-01  1.318e-02  -9.976
neighbourhood_simplifiedWest                    -1.799e-01  8.366e-03 -21.500
availability_30                                  1.150e-02  4.365e-04  26.333
reviews_per_month                               -1.248e-02  2.248e-03  -5.549
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home       0.0046 ** 
prop_type_simplifiedEntire townhouse              0.4913    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                                < 2e-16 ***
review_scores_rating                             < 2e-16 ***
bathrooms                                        < 2e-16 ***
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
host_is_superhostTRUE                           4.41e-06 ***
neighbourhood_simplifiedEast                     < 2e-16 ***
neighbourhood_simplifiedNorth                    < 2e-16 ***
neighbourhood_simplifiedSouth                    < 2e-16 ***
neighbourhood_simplifiedWest                     < 2e-16 ***
availability_30                                  < 2e-16 ***
reviews_per_month                               2.95e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3575 on 11039 degrees of freedom
Multiple R-squared:  0.4576,    Adjusted R-squared:  0.4568 
F-statistic:   582 on 16 and 11039 DF,  p-value: < 2.2e-16

4.8 Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using autoplot(model_x)
  2. As you start building models with more explanatory variables, make sure you use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 5 or 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
huxreg(model1, model2, model3, model4, model5, model6)
(1)(2)(3)(4)(5)(6)
(Intercept)5.803 ***5.931 ***5.103 ***6.411 ***5.188 ***5.170 ***
(0.061)   (0.059)   (0.056)   (0.008)   (0.052)   (0.052)   
prop_type_simplifiedEntire residential home0.190 ***0.191 ***-0.126 ***        -0.047 ** -0.046 ** 
(0.018)   (0.017)   (0.017)           (0.016)   (0.016)   
prop_type_simplifiedEntire townhouse0.270 ***0.273 ***-0.047            -0.016    -0.016    
(0.027)   (0.026)   (0.024)           (0.023)   (0.023)   
prop_type_simplifiedOther-0.076 ***0.238 ***-0.079 ***        -0.132 ***-0.134 ***
(0.012)   (0.016)   (0.011)           (0.010)   (0.010)   
prop_type_simplifiedPrivate room in rental unit-0.531 ***0.017    -0.417 ***        -0.443 ***-0.444 ***
(0.014)   (0.024)   (0.013)           (0.012)   (0.012)   
number_of_reviews-0.001 ***-0.000 ** -0.001 ***        -0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)           (0.000)   (0.000)   
review_scores_rating0.117 ***0.089 ***0.118 ***        0.129 ***0.131 ***
(0.013)   (0.012)   (0.011)           (0.010)   (0.010)   
room_typeHotel room        -0.250 ***                                
        (0.056)                                   
room_typePrivate room        -0.566 ***                                
        (0.020)                                   
room_typeShared room        -0.655 ***                                
        (0.094)                                   
bathrooms                0.132 ***        0.122 ***0.122 ***
                (0.011)           (0.010)   (0.010)   
bedrooms                0.118 ***        0.137 ***0.141 ***
                (0.009)           (0.008)   (0.007)   
beds                0.009            0.006            
                (0.005)           (0.005)           
accommodates                0.125 ***        0.115 ***0.118 ***
                (0.005)           (0.005)   (0.005)   
host_is_superhostTRUE                0.070 ***        0.050 ***0.049 ***
                (0.011)           (0.011)   (0.011)   
instant_bookableTRUE                0.006            -0.008            
                (0.009)           (0.009)           
neighbourhood_simplifiedEast                        -0.185 ***-0.252 ***-0.252 ***
                        (0.013)   (0.010)   (0.010)   
neighbourhood_simplifiedNorth                        -0.212 ***-0.339 ***-0.338 ***
                        (0.022)   (0.017)   (0.017)   
neighbourhood_simplifiedSouth                        -0.083 ***-0.132 ***-0.131 ***
                        (0.018)   (0.013)   (0.013)   
neighbourhood_simplifiedWest                        -0.184 ***-0.180 ***-0.180 ***
                        (0.011)   (0.008)   (0.008)   
availability_30                                0.011 ***0.011 ***
                                (0.000)   (0.000)   
reviews_per_month                                -0.012 ***-0.012 ***
                                (0.002)   (0.002)   
N11056        11056        11035        11056        11035        11056        
R20.171    0.227    0.375    0.031    0.457    0.458    
logLik-6647.689    -6264.323    -5067.142    -7511.196    -4288.503    -4305.492    
AIC13311.379    12550.645    10162.285    15034.393    8617.006    8646.983    
*** p < 0.001; ** p < 0.01; * p < 0.05.
# Prediction using model6; model 6 has by far the highest R^2 and is less prone to collinearity than model5

listings_selection <- listings_clean5 %>% 
  filter(prop_type_simplified == "Private room in rental unit",
         number_of_reviews >= 10,
         review_scores_rating / 5 >= 0.9)

set.seed(1234)

train_test_split <- initial_split(listings_selection, prop = 0.75)
listings_train <- training(train_test_split)
listings_test <- testing(train_test_split)

rmse_train <- listings_train %>% 
  mutate(predictions = predict(model6, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  pull()

rmse_train # 0.05610197
[1] 0.1877019
rmse_test <- listings_test %>% 
  mutate(predictions = predict(model6, .)) %>% 
  summarise(sqrt(sum(predictions - log(price_4_nights))**2/n())) %>% 
  pull()

rmse_test # 0.2981508
[1] 0.8266823
listings_specifications <- tibble(prop_type_simplified = "Private room in rental unit",
               number_of_reviews = 10,
               review_scores_rating = 0.9 * 5,
               # room_type + GVIF higher than 5
               bathrooms = 2,
               bedrooms = 2,
               accommodates = 4,
               host_is_superhost = TRUE,
               neighbourhood_simplified = "Center",
               availability_30 = 4,
               reviews_per_month = 5)

exp(predict(model6, newdata = listings_specifications, interval = "confidence"))
       fit      lwr      upr
1 567.2637 545.6611 589.7215
# 95% CI for price for our specifications between $545.66 and $589.72 for our trip

Regression Analysis From the Analysis above we can see that model6 is the best model to use out of the 6 models. Model6 explains us the highest variance of 45.8% in Prices while the rest are much below that with Model3 explaining 37.5% of the variance , Model5 45.7%, Model2 22.7%, Model1 17.1% and Model4 only 3.1%. Furthermore, Model6 incorporates the highest number of significant variables in model making it more robust and Explanatory than any of the other models. Hence, due to the aforementioned reasons we believe that Model6 is the best Model that should be used for further analysis.

The number of bathrooms is significant at the 1% level and “accomodates” is significant at the 0.1% level. Similarly, the variable “bedroom” is also significant at the 5% level in model 5. This indicates that the independant variables do improve the regression model and thus allow for a more sound prediction of the expected price. The accomodation variable can be interpreted as - an increase in the individuals residing leads to a 14.9% rise in price. Similarly, an increase in the bedroom and bathroom leads to a 10.2% and 11.9% increase in the relevant price respectively. This is logical given that more 2 bedrooms corressponds to a larger flat/hotel and thus is likely to cost more. Controlling for “host being a super-host” also suggests that a super-host tends to charge a 5.6% higher price than a normal host, however, this coefficient is statistically insignificant. Analogous to being a super host, the variable “instant book” is also statistically insiginicant at the 5% level and hence does not appear to have a large impact on the dependant variable (price_4_nights) The 4 variables “neighbourhood simplified south, north, east and west” are streamlined to capture the effect of location on the dependant variable. As showcased in the table above, all 4 variables are statistically significant at the 0.1% level. The coefficients are all negative which suggests that, for example, staying in a listing located in the north leads to a 28.1% decline in expected price for 4 nights. This also appears to be logical as, people using airbnb in Amsterdam for 4 nights are likely to be tourists and thus will target a listing in central Amsterdam. This lack of demand for staying outside the central region indicates the negative effect of location on price. Lastly, the variable “availability_30” is also statistically significant at the 0.1% level and thus suggests that the availability of the listing does have an impact on the listings price. Surpirisngly, the impact of the variable “listings_per_month” is insignificant and thus does not appear to have a commanding impact on the listings price. This could be due to several factors - tourists may be more concerned with the convenience and the location of the listing rather than the reviews, similarly, tourists could also potentially use Amsterdam as a ‘gateway to western Europe’ and thus the reviews are unlikely to have a significant impact on the listings price

  1. Create a summary table, using huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
  2. Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.

5 Deliverables

  • By midnight on Monday 17 Oct 2022, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
  1. Executive Summary: Based on your best model, indicate the factors that influence price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.
  2. Data Exploration and Feature Selection: Present key elements of the data, including tables and graphs that help the reader understand the important variables in the dataset. Describe how the data was cleaned and prepared, including feature selection, transformations, interactions, and other approaches you considered.
  3. Model Selection and Validation: Describe the model fitting and validation process used. State the model you selected and why they are preferable to other choices.
  4. Findings and Recommendations: Interpret the results of the selected model and discuss additional steps that might improve the analysis

Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates; in other words, (i.e. I don’t want to see stuff I wrote in your final report.)

6 Acknowledgements